Setup

library(rstan)
library(dplyr)
library(lubridate)
library(ggplot2)
library(bayesplot)

theme_set(bayesplot::theme_default())

# seed for R's pseudo-RNGs, not Stan's
set.seed(1123) 

# load data
pest_data <- readRDS('data/pest_data.RDS')
standata_hier <- readRDS('data/standata_hier.RDS')

The problem

Background

Imagine that you are a statistician or data scientist working as an independent contractor. One of your clients is a company that owns many residential buildings throughout New York City. The property manager explains that they are concerned about the number of cockroach complaints that they receive from their buildings. Previously the company has offered monthly visits from a pest inspector as a solution to this problem. While this is the default solution of many property managers in NYC, the tenants are rarely home when the inspector visits, and so the manager reasons that this is a relatively expensive solution that is currently not very effective.

One alternative to this problem is to deploy long term bait stations. In this alternative, child and pet safe bait stations are installed throughout the apartment building. Cockroaches obtain quick acting poison from these stations and distribute it throughout the colony. The manufacturer of these bait stations provides some indication of the space-to-bait efficacy, but the manager suspects that this guidance was not calculated with NYC roaches in mind. NYC roaches, the manager rationalizes, have more hustle than traditional roaches, and NYC buildings are built differently than other common residential buildings in the US. This is particularly important as the unit cost for each bait station per year is quite high.

The goal

The manager wishes to employ your services to help them to find the optimal number of roach bait stations they should place in each of their buildings in order to minimize the number of cockroach complaints while also keeping expenditure on pest control affordable.

A subset of the company’s buildings have been randomly selected for an experiment:

  • At the beginning of each month, a pest inspector randomly places a number of bait stations throughout the building, without knowledge of the current cockroach levels in the building
  • At the end of the month, the manager records the total number of cockroach complaints in that building.
  • The manager would like to determine the optimal number of bait stations (we’ll call them \(\textrm{traps}\)) that balances the lost revenue (\(R\)) that complaints (\(\textrm{complaints}\)) generate with the all-in cost of maintaining the traps (\(\textrm{C}\)).

Fortunately, Bayesian data analysis provides a coherent framework for us to tackle this problem.

Formally, we are interested in finding

\[ \arg\max_{\textrm{traps} \in \mathbb{N}} \mathbb{E}_{\text{complaints}}[R(\textrm{complaints}(\textrm{traps})) - \textrm{C}(\textrm{traps})] \]

The property manager would also, if possible, like to learn how these results generalize to buildings they haven’t treated so they can understand the potential costs of pest control at buildings they are acquiring as well as for the rest of their building portfolio.

As the property manager has complete control over the number of traps set, the random variable contributing to this expectation is the number of complaints given the number of traps. We will model the number of complaints as a function of the number of traps.

The data

The data provided to us is in a file called pest_data.RDS. Let’s load the data and see what the structure is:

pest_data <- readRDS('data/pest_data.RDS')
str(pest_data)
'data.frame':   120 obs. of  13 variables:
 $ building_id         : int  37 37 37 37 37 37 37 37 37 37 ...
 $ date                : Date, format: "2017-01-15" "2017-02-14" ...
 $ traps               : num  8 8 9 10 11 11 10 10 9 9 ...
 $ floors              : num  8 8 8 8 8 8 8 8 8 8 ...
 $ sq_footage_p_floor  : num  5149 5149 5149 5149 5149 ...
 $ live_in_super       : num  0 0 0 0 0 0 0 0 0 0 ...
 $ monthly_average_rent: num  3847 3847 3847 3847 3847 ...
 $ average_tenant_age  : num  53.9 53.9 53.9 53.9 53.9 ...
 $ age_of_building     : num  47 47 47 47 47 47 47 47 47 47 ...
 $ total_sq_foot       : num  41192 41192 41192 41192 41192 ...
 $ month               : num  1 2 3 4 5 6 7 8 9 10 ...
 $ complaints          : num  1 3 0 1 0 0 4 3 2 2 ...
 $ log_sq_foot_1e4     : num  1.42 1.42 1.42 1.42 1.42 ...

We have access to the following fields:

First, let’s see how many buildings we have in the data:

length(unique(pest_data$building_id))
[1] 10

And make some plots of the raw data:

ggplot(pest_data, aes(x = complaints)) + 
  geom_bar()

ggplot(pest_data, aes(x = traps, y = complaints)) + 
  geom_jitter()

ggplot(pest_data, aes(x = date, y = complaints, color = live_in_super == TRUE)) + 
  geom_line(aes(linetype = "Number of complaints")) + 
  geom_point(color = "black") + 
  geom_line(aes(y = traps, linetype = "Number of traps"), color = "black", size = 0.25) + 
  facet_wrap(~building_id, scales = "free", ncol = 2, labeller = label_both) + 
  scale_x_date(name = "Month", date_labels = "%b") + 
  scale_y_continuous(name = "", limits = range(pest_data$complaints)) + 
  scale_linetype_discrete(name = "") + 
  scale_color_discrete(name = "Live-in super")

The first question we’ll look at is just whether the number of complaints per building per month is associated with the number of bait stations per building per month, ignoring the temporal and across-building variation (we’ll get to that later). This requires only two variables, \(\textrm{complaints}\) and \(\textrm{traps}\). How should we model the number of complaints?

Modeling count data : Poisson distribution

We already know some rudimentary information about what we should expect. The number of complaints over a month should be either zero or an integer. The property manager tells us that it is possible but unlikely that number of complaints in a given month is zero. Occasionally there are a very large number of complaints in a single month. A common way of modeling this sort of skewed, single bounded count data is as a Poisson random variable. One concern about modeling the outcome variable as Poisson is that the data may be over-dispersed, but we’ll start with the Poisson model and then check whether over-dispersion is a problem by comparing our model’s predictions to the data.

Model

Given that we have chosen a Poisson regression, we define the likelihood to be the Poisson probability mass function over the number bait stations placed in the building, denoted below as traps. This model assumes that the mean and variance of the outcome variable complaints (number of complaints) is the same. We’ll investigate whether this is a good assumption after we fit the model.

For building \(b = 1,\dots,10\) at time (month) \(t = 1,\dots,12\), we have

\[ \begin{align*} \textrm{complaints}_{b,t} & \sim \textrm{Poisson}(\lambda_{b,t}) \\ \lambda_{b,t} & = \exp{(\eta_{b,t})} \\ \eta_{b,t} &= \alpha + \beta \, \textrm{traps}_{b,t} \end{align*} \]

Prior predictive checks

Before we fit the model to real data, we should check that our priors and model can generate plausible simulated data.

In R we can simulate from the prior predictive distribution using a function like this:

# using normal distributions for priors on alpha and beta 
simple_poisson_dpg <- function(traps, alpha_mean, alpha_sd, beta_mean, beta_sd) {
  N <- length(traps)
  alpha <- rnorm(1, mean = alpha_mean, sd = alpha_sd);
  beta <- rnorm(1, mean = beta_mean, sd = beta_sd);
  complaints <- rpois(N, lambda = exp(alpha + beta * traps))
  return(complaints)
}

Try with \(N(0,10)\) priors on both parameters:

# you can run this chunk multiple times to keep generating different datasets
simple_poisson_dpg(
  pest_data$traps,
  alpha_mean = 0,
  alpha_sd = 10,
  beta_mean = 0,
  beta_sd = 10
)
  [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 [38] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 [75] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[112] 0 0 0 0 0 0 0 0 0

Try with \(N(0,1)\) priors on both parameters:

simple_poisson_dpg(
  pest_data$traps,
  alpha_mean = 0,
  alpha_sd = 1,
  beta_mean = 0,
  beta_sd = 1
)
  [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
 [38] 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0
 [75] 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 2 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0
[112] 0 0 1 0 0 1 1 0 0

Let’s calculate some summary statistics of 1000 data sets generated according to the \(N(0,1)\) priors:

prior_preds <- t(replicate(1000, expr = {
  simple_poisson_dpg(
    traps = pest_data$traps,
    alpha_mean = 0,
    alpha_sd = 1,
    beta_mean = 0,
    beta_sd = 1
  )
}))
Warning in rpois(N, lambda = exp(alpha + beta * traps)): NAs produced

Warning in rpois(N, lambda = exp(alpha + beta * traps)): NAs produced

Warning in rpois(N, lambda = exp(alpha + beta * traps)): NAs produced

Warning in rpois(N, lambda = exp(alpha + beta * traps)): NAs produced

Warning in rpois(N, lambda = exp(alpha + beta * traps)): NAs produced

Warning in rpois(N, lambda = exp(alpha + beta * traps)): NAs produced

Warning in rpois(N, lambda = exp(alpha + beta * traps)): NAs produced

Warning in rpois(N, lambda = exp(alpha + beta * traps)): NAs produced

Warning in rpois(N, lambda = exp(alpha + beta * traps)): NAs produced

Warning in rpois(N, lambda = exp(alpha + beta * traps)): NAs produced

Warning in rpois(N, lambda = exp(alpha + beta * traps)): NAs produced

Warning in rpois(N, lambda = exp(alpha + beta * traps)): NAs produced

Warning in rpois(N, lambda = exp(alpha + beta * traps)): NAs produced

Warning in rpois(N, lambda = exp(alpha + beta * traps)): NAs produced

Warning in rpois(N, lambda = exp(alpha + beta * traps)): NAs produced

Warning in rpois(N, lambda = exp(alpha + beta * traps)): NAs produced

Warning in rpois(N, lambda = exp(alpha + beta * traps)): NAs produced
dim(prior_preds)
[1] 1000  120
summary(apply(prior_preds, 1, mean))
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
       0        0        1   297275       96 52806051       17 
summary(apply(prior_preds, 1, min))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  0.000   0.000   0.000   1.384   1.000  68.000      17 
summary(apply(prior_preds, 1, max))
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
0.000e+00 1.000e+00 4.000e+00 9.548e+06 6.350e+02 1.693e+09        17 

A more reasonable prior:

simple_poisson_dpg(
  traps = pest_data$traps,
  alpha_mean = log(7),
  alpha_sd = 0.5,
  beta_mean = -0.25,
  beta_sd = 0.5
)
  [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 [38] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 [75] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[112] 0 0 0 0 0 1 0 0 4

Simulate 1000 times and calculate summary stats:

prior_preds <- t(replicate(1000, expr = {
  simple_poisson_dpg(
    traps = pest_data$traps,
    alpha_mean = log(7),
    alpha_sd = 0.5,
    beta_mean = -0.25,
    beta_sd = 0.5
  )
}))
summary(apply(prior_preds, 1, mean))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0       0       2    8547      15 8014477 
summary(apply(prior_preds, 1, min))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    0.00    0.00    3.09    4.00   60.00 
summary(apply(prior_preds, 1, max))
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
        0         4         9    232629        30 225139844 

Writing and fitting our first Stan program

Let’s encode the model in a Stan program.

  • Write simple_poisson_regression.stan together, put in stan_programs directory.
comp_simple <- stan_model('stan_programs/simple_poisson_regression.stan')

To fit the model to the data we’ll first create a list to pass to Stan using the variables in the pest_data data frame. The names of the list elements must match the names used in the data block of the Stan program.

standata_simple <- list(
  N = nrow(pest_data), 
  complaints = pest_data$complaints,
  traps = pest_data$traps
)
str(standata_simple)
List of 3
 $ N         : int 120
 $ complaints: num [1:120] 1 3 0 1 0 0 4 3 2 2 ...
 $ traps     : num [1:120] 8 8 9 10 11 11 10 10 9 9 ...

We have already compiled the model, so we can jump straight to sampling from it.

fit_simple <- sampling(comp_simple, data = standata_simple,
                       # these are the defaults but specifying them anyway
                       # so you can see how to use them: 
                       # posterior sample size = chains * (iter-warmup)
                       chains = 4, iter = 2000, warmup = 1000)

SAMPLING FOR MODEL 'simple_poisson_regression' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 3.2e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.32 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 1: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 1: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 1: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 1: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 1: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 1: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 1: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 1: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 1: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 1: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0.128012 seconds (Warm-up)
Chain 1:                0.113734 seconds (Sampling)
Chain 1:                0.241746 seconds (Total)
Chain 1: 

SAMPLING FOR MODEL 'simple_poisson_regression' NOW (CHAIN 2).
Chain 2: 
Chain 2: Gradient evaluation took 1.2e-05 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.12 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 2: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 2: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 2: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 2: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 2: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 2: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 2: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 2: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 2: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 2: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 2: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 0.117357 seconds (Warm-up)
Chain 2:                0.115251 seconds (Sampling)
Chain 2:                0.232608 seconds (Total)
Chain 2: 

SAMPLING FOR MODEL 'simple_poisson_regression' NOW (CHAIN 3).
Chain 3: 
Chain 3: Gradient evaluation took 1.4e-05 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.14 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3: 
Chain 3: 
Chain 3: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 3: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 3: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 3: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 3: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 3: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 3: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 3: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 3: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 3: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 3: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 3: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 3: 
Chain 3:  Elapsed Time: 0.127274 seconds (Warm-up)
Chain 3:                0.105287 seconds (Sampling)
Chain 3:                0.232561 seconds (Total)
Chain 3: 

SAMPLING FOR MODEL 'simple_poisson_regression' NOW (CHAIN 4).
Chain 4: 
Chain 4: Gradient evaluation took 1.3e-05 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.13 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4: 
Chain 4: 
Chain 4: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 4: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 4: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 4: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 4: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 4: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 4: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 4: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 4: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 4: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 4: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 4: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 4: 
Chain 4:  Elapsed Time: 0.11258 seconds (Warm-up)
Chain 4:                0.097739 seconds (Sampling)
Chain 4:                0.210319 seconds (Total)
Chain 4: 

Print the summary of the intercept and slope parameters:

print(fit_simple, pars = c('alpha','beta'))
Inference for Stan model: simple_poisson_regression.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

       mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
alpha  2.60    0.01 0.16  2.27  2.49  2.60  2.70  2.93   719    1
beta  -0.19    0.00 0.02 -0.25 -0.21 -0.19 -0.18 -0.15   729    1

Samples were drawn using NUTS(diag_e) at Fri Apr 24 17:13:59 2020.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

We can also plot the posterior distributions:

# extract posterior samples of alpha and beta
draws <- as.matrix(fit_simple, pars = c('alpha','beta'))

# https://mc-stan.org/bayesplot/reference/MCMC-distributions
mcmc_hist(draws) # marginal posteriors of alpha and beta
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

mcmc_scatter(draws, alpha = 0.2, size = 1) # posterior of (alpha,beta)

And compare them to draws from the prior:

alpha_prior_post <- cbind(alpha_prior = rnorm(4000, log(7), 1), 
                          alpha_posterior = draws[, "alpha"])
mcmc_hist(alpha_prior_post, facet_args = list(nrow = 2), binwidth = 0.1) + 
  xlim(range(alpha_prior_post))

beta_prior_post <- cbind(beta_prior = rnorm(4000, -0.25, 0.5), 
                         beta_posterior = draws[, "beta"])
mcmc_hist(beta_prior_post, facet_args = list(nrow = 2), binwidth = 0.05) + 
  xlim(range(beta_prior_post))

From the posterior of beta, it appears the number of bait stations set in a building is associated with the number of complaints about cockroaches that were made in the following month. However, we still need to consider how well the model fits.

Posterior predictive checking

# see http://mc-stan.org/rstan/articles/stanfit_objects.html for various ways
# of extracting contents from stanfit objects
y_rep <- as.matrix(fit_simple, pars = "y_rep")
dim(y_rep) # number of posterior draws x number of data points
[1] 4000  120
# https://mc-stan.org/bayesplot/reference/PPC-distributions#plot-descriptions
ppc_dens_overlay(y = standata_simple$complaints, yrep = y_rep[1:200,])

In the plot above we have the kernel density estimate of the observed data (\(y\), thicker curve) and 200 simulated data sets (\(y_{rep}\), thin curves) from the posterior predictive distribution. Here the posterior predictive simulations are not as dispersed as the real data and don’t seem to capture the rate of zeros well at all. This suggests the Poisson model may not be sufficiently flexible for this data.

Let’s explore this further by looking directly at the proportion of zeros in the real data and predicted data.

# calculate the proportion of zeros in a vector
prop_zero <- function(x) mean(x == 0)

# https://mc-stan.org/bayesplot/reference/PPC-test-statistics#plot-descriptions
ppc_stat(
  y = standata_simple$complaints,
  yrep = y_rep,
  stat = "prop_zero",
  binwidth = .01
)

The plot above shows the observed proportion of zeros (thick vertical line) and a histogram of the proportion of zeros in each of the simulated data sets. It is clear that the model does not capture this feature of the data well at all.

The rootogram is another useful plot to compare the observed vs expected number of complaints. This is a plot of the square root of the expected counts (continuous line) vs the observed counts (blue histogram)

# https://mc-stan.org/bayesplot/reference/PPC-discrete#plot-descriptions
ppc_rootogram(standata_simple$complaints, yrep = y_rep)

If the model was fitting well these would be relatively similar, however in this figure we can see the number of complaints is underestimated if there are few complaints, over-estimated for medium numbers of complaints, and underestimated if there are a large number of complaints.

The ppc_bars() function will make a bar plot of the observed values and overlay prediction intervals but not on the square root scale (unlike the rootogram):

# https://mc-stan.org/bayesplot/reference/PPC-discrete#plot-descriptions
ppc_bars(standata_simple$complaints, yrep = y_rep) + 
  labs(x = "Complaints")

We can also view how the predicted number of complaints varies with the number of traps:

ppc_intervals(y = standata_simple$complaints, yrep = y_rep,
              # jitter number of traps since multiple observations at some values
              x = standata_simple$traps + rnorm(standata_simple$N, 0, 0.2)) +  
  labs(x = "Number of traps", y = "Predicted number of complaints")

Expanding the model: multiple predictors

Currently, our model’s mean parameter is a rate of complaints per 30 days, but we’re modeling a process that occurs over an area as well as over time. We have the square footage of each building, so if we add that information into the model, we can interpret our parameters as a rate of complaints per square foot per 30 days.

\[ \begin{align*} \textrm{complaints}_{b,t} & \sim \textrm{Poisson}(\textrm{sq_foot}_b\,\lambda_{b,t}) \\ \lambda_{b,t} & = \exp{(\eta_{b,t} )} \\ \eta_{b,t} &= \alpha + \beta \, \textrm{traps}_{b,t} \end{align*} \]

The term \(\text{sq_foot}\) is called an exposure term. If we log the term, we can put it in \(\eta_{b,t}\):

\[ \begin{align*} \textrm{complaints}_{b,t} & \sim \textrm{Poisson}(\lambda_{b,t}) \\ \lambda_{b,t} & = \exp{(\eta_{b,t} )} \\ \eta_{b,t} &= \alpha + \beta \, \textrm{traps}_{b,t} + \textrm{log_sq_foot}_b \end{align*} \]

We will also include whether there is a live-in super or not as a predictor for the number of complaints, which gives us gives us:

\[ \eta_{b,t} = \alpha + \beta \, \textrm{traps}_{b,t} + \beta_{\rm super} \textrm{live_in_super}_{b,t} + \textrm{log_sq_foot}_b \]

Add these new variables to the data list for Stan.

standata_simple$log_sq_foot <- pest_data$log_sq_foot_1e4 # log(total_sq_foot/1e4)
standata_simple$live_in_super <- pest_data$live_in_super

Stan program for Poisson multiple regression

Now we need a new Stan program that uses multiple predictors.

  • Write multiple_poisson_regression.stan
comp_multi <- stan_model('stan_programs/multiple_poisson_regression.stan')

Fit the Poisson multiple regression

fit_multi <- sampling(comp_multi, data = standata_simple,
                      refresh = 0) # turn off printed progress updates
print(fit_multi, pars = c("alpha", "beta", "beta_super"))
Inference for Stan model: multiple_poisson_regression.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

            mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
alpha       1.22    0.01 0.21  0.80  1.08  1.23  1.36  1.63   830 1.01
beta       -0.22    0.00 0.03 -0.27 -0.24 -0.22 -0.20 -0.16   904 1.01
beta_super -0.30    0.00 0.13 -0.55 -0.38 -0.30 -0.21 -0.05   962 1.00

Samples were drawn using NUTS(diag_e) at Fri Apr 24 17:15:02 2020.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).
posterior_draws <- as.matrix(fit_multi, pars = c("alpha", "beta", "beta_super"))
y_rep <- as.matrix(fit_multi, pars = "y_rep")
ppc_dens_overlay(standata_simple$complaints, y_rep[1:200,])

This again looks like we haven’t captured the smaller counts very well, nor have we captured the larger counts.

We’re still severely underestimating the proportion of zeros in the data:

prop_zero <- function(x) mean(x == 0)
ppc_stat(y = standata_simple$complaints, yrep = y_rep, 
         stat = "prop_zero", binwidth = 0.01)

Ideally this vertical line would fall somewhere within the histogram.

We can also plot uncertainty intervals for the predicted complaints for different numbers of traps.

ppc_intervals(
  y = standata_simple$complaints, 
  yrep = y_rep,
  x = standata_simple$traps + rnorm(standata_simple$N, 0, 0.2)
) + 
  labs(x = "Number of traps", y = "Number of complaints")

We can see that we’ve increased the tails a bit more at the larger numbers of traps but we still have some large observed numbers of complaints that the model would consider extremely unlikely events.

Modeling count data with the Negative Binomial

When we considered modeling the data using a Poisson, we saw that the model didn’t appear to fit as well to the data as we would like. In particular the model underpredicted low and high numbers of complaints, and overpredicted the medium number of complaints. This is one indication of over-dispersion, where the variance is larger than the mean. A Poisson model doesn’t fit over-dispersed count data very well because the same parameter \(\lambda\), controls both the expected counts and the variance of these counts. The natural alternative to this is the negative binomial model:

\[ \begin{align*} \text{complaints}_{b,t} & \sim \text{Neg-Binomial}(\lambda_{b,t}, \phi) \\ \lambda_{b,t} & = \exp{(\eta_{b,t})} \\ \eta_{b,t} &= \alpha + \beta \, {\rm traps}_{b,t} + \beta_{\rm super} \, {\rm super}_{b} + \text{log_sq_foot}_{b} \end{align*} \]

In Stan the negative binomial mass function we’ll use is called \(\texttt{neg_binomial_2_log}(\text{ints} \, y, \text{reals} \, \eta, \text{reals} \, \phi)\) in Stan. Like the poisson_log function, this negative binomial mass function that is parameterized in terms of its log-mean, \(\eta\), but it also has a precision \(\phi\) such that

\[ \mathbb{E}[y] \, = \lambda = \exp(\eta) \]

\[ \text{Var}[y] = \lambda + \lambda^2/\phi = \exp(\eta) + \exp(\eta)^2 / \phi. \]

As \(\phi\) gets larger the term \(\lambda^2 / \phi\) approaches zero and so the variance of the negative-binomial approaches \(\lambda\), i.e., the negative-binomial gets closer and closer to the Poisson.

Stan program for negative-binomial regression

  • Write multiple_NB_regression.stan together
comp_multi_NB <- stan_model('stan_programs/multiple_NB_regression.stan')

Fit to data and check the fit

fit_multi_NB <- sampling(comp_multi_NB, data = standata_simple)

SAMPLING FOR MODEL 'multiple_NB_regression' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 0.000102 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 1.02 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 1: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 1: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 1: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 1: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 1: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 1: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 1: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 1: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 1: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 1: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0.992699 seconds (Warm-up)
Chain 1:                0.873713 seconds (Sampling)
Chain 1:                1.86641 seconds (Total)
Chain 1: 

SAMPLING FOR MODEL 'multiple_NB_regression' NOW (CHAIN 2).
Chain 2: 
Chain 2: Gradient evaluation took 4.5e-05 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.45 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 2: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 2: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 2: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 2: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 2: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 2: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 2: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 2: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 2: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 2: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 2: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 1.04152 seconds (Warm-up)
Chain 2:                0.835047 seconds (Sampling)
Chain 2:                1.87657 seconds (Total)
Chain 2: 

SAMPLING FOR MODEL 'multiple_NB_regression' NOW (CHAIN 3).
Chain 3: 
Chain 3: Gradient evaluation took 4.3e-05 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.43 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3: 
Chain 3: 
Chain 3: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 3: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 3: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 3: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 3: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 3: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 3: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 3: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 3: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 3: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 3: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 3: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 3: 
Chain 3:  Elapsed Time: 1.03573 seconds (Warm-up)
Chain 3:                0.897315 seconds (Sampling)
Chain 3:                1.93304 seconds (Total)
Chain 3: 

SAMPLING FOR MODEL 'multiple_NB_regression' NOW (CHAIN 4).
Chain 4: 
Chain 4: Gradient evaluation took 9.7e-05 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.97 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4: 
Chain 4: 
Chain 4: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 4: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 4: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 4: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 4: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 4: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 4: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 4: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 4: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 4: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 4: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 4: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 4: 
Chain 4:  Elapsed Time: 1.05344 seconds (Warm-up)
Chain 4:                0.826121 seconds (Sampling)
Chain 4:                1.87956 seconds (Total)
Chain 4: 
# to demonstrate extracting as a list instead of using as.matrix
samps_NB <- rstan::extract(fit_multi_NB) 
str(samps_NB)
List of 8
 $ alpha     : num [1:4000(1d)] 1.05 1.39 1.43 1.88 1.42 ...
  ..- attr(*, "dimnames")=List of 1
  .. ..$ iterations: NULL
 $ beta      : num [1:4000(1d)] -0.203 -0.228 -0.243 -0.299 -0.237 ...
  ..- attr(*, "dimnames")=List of 1
  .. ..$ iterations: NULL
 $ beta_super: num [1:4000(1d)] -0.175 -0.386 -0.243 -0.451 -0.109 ...
  ..- attr(*, "dimnames")=List of 1
  .. ..$ iterations: NULL
 $ inv_phi   : num [1:4000(1d)] 0.822 0.373 0.815 0.578 0.734 ...
  ..- attr(*, "dimnames")=List of 1
  .. ..$ iterations: NULL
 $ phi       : num [1:4000(1d)] 1.22 2.68 1.23 1.73 1.36 ...
  ..- attr(*, "dimnames")=List of 1
  .. ..$ iterations: NULL
 $ eta       : num [1:4000, 1:120] 0.839 0.979 0.906 0.91 0.936 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ iterations: NULL
  .. ..$           : NULL
 $ y_rep     : num [1:4000, 1:120] 4 1 1 3 3 2 1 0 2 0 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ iterations: NULL
  .. ..$           : NULL
 $ lp__      : num [1:4000(1d)] 232 230 232 233 232 ...
  ..- attr(*, "dimnames")=List of 1
  .. ..$ iterations: NULL
mcmc_hist(as.matrix(fit_multi_NB, pars = c("inv_phi", "phi")))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Let’s look at our predictions vs. the data.

ppc_dens_overlay(y = standata_simple$complaints, yrep = samps_NB$y_rep[1:200,])

It appears that our model now captures both the number of small counts better as well as the tails.

Let’s check the proportion of zeros:

ppc_stat(y = standata_simple$complaints, yrep = samps_NB$y_rep, 
         stat = "prop_zero", binwidth = 0.01)

Check predictions by number of traps:

ppc_intervals(
  y = standata_simple$complaints, 
  yrep = samps_NB$y_rep,
  x = standata_simple$traps + rnorm(standata_simple$N, 0, 0.2)
) + 
  labs(x = "Number of traps (jittered)", y = "Number of complaints")

We haven’t used the fact that the data are clustered by building yet. A posterior predictive check might elucidate whether it would be a good idea to add the building information into the model.

ppc_stat_grouped(
  y = standata_simple$complaints, 
  yrep = samps_NB$y_rep, 
  group = pest_data$building_id, 
  stat = 'mean',
  binwidth = 0.2
)

We’re getting plausible predictions for most building means but some are estimated better than others and some have large uncertainties. If we explicitly model the variation across buildings we may be able to get better estimates.

Hierarchical modeling

Modeling varying intercepts for each building

Let’s add a hierarchical intercept parameter, \(\mu_b\) at the building level to our model.

\[ \text{complaints}_{b,t} \sim \text{Neg-Binomial}(\lambda_{b,t}, \phi) \\ \lambda_{b,t} = \exp{(\eta_{b,t})} \\ \eta_{b,t} = \mu_b + \beta \, {\rm traps}_{b,t} + \beta_{\rm super}\, {\rm super}_b + \text{log_sq_foot}_b \\ \mu_b \sim \text{Normal}(\alpha, \sigma_{\mu}) \]

In our Stan model, \(\mu_b\) is the \(b\)-th element of the vector \(\texttt{mu}\) which has one element per building.

One of our predictors varies only by building, so we can rewrite the above model more efficiently like so:

\[ \eta_{b,t} = \mu_b + \beta \, {\rm traps}_{b,t} + \text{log_sq_foot}_b\\ \mu_b \sim \text{Normal}(\alpha + \beta_{\text{super}} \, \text{super}_b , \sigma_{\mu}) \]

We have more information at the building level as well, like the average age of the residents, the average age of the buildings, and the average per-apartment monthly rent so we can add that data into a matrix called building_data, which will have one row per building and four columns:

  • live_in_super
  • age_of_building
  • average_tentant_age
  • monthly_average_rent

We’ll write the Stan model like:

\[ \eta_{b,t} = \mu_b + \beta \, {\rm traps} + \text{log_sq_foot}\\ \mu \sim \text{Normal}(\alpha + \texttt{building_data} \, \zeta, \,\sigma_{\mu}) \]

Prepare building data for hierarchical model

We’ll need to do some more data prep before we can fit our models. Among other things, in order to use the building variable in Stan we will need to transform it from a factor variable to an integer variable.

N_months <- length(unique(pest_data$date))
N_buildings <- length(unique(pest_data$building_id))

# Add some IDs for building and month
pest_data <- pest_data %>%
  mutate(
    building_fac = factor(building_id, levels = unique(building_id)),
    building_idx = as.integer(building_fac),
    ids = rep(1:N_months, N_buildings),
    mo_idx = lubridate::month(date) # we'll use this later
  )

# Center and rescale the building specific data
building_data <- pest_data %>%
    select(
      building_idx,
      live_in_super,
      age_of_building,
      average_tenant_age,
      monthly_average_rent
    ) %>%
    unique() %>%
    arrange(building_idx) %>%
    select(-building_idx) %>%
    scale(scale=FALSE) %>% # this centers (subtracts mean) but doesn't divide by sd
    as.data.frame() %>%
    mutate( # scale by constants to put variables on similar scales
      age_of_building = age_of_building / 10,
      average_tenant_age = average_tenant_age / 10,
      monthly_average_rent = monthly_average_rent / 1000
    ) %>%
    as.matrix()

# Make data list for Stan
standata_hier <-
  with(pest_data,
        list(complaints = complaints,
             N = length(complaints),
             traps = traps,
             log_sq_foot = log(pest_data$total_sq_foot/1e4),
             M = N_months,
             mo_idx = as.integer(as.factor(date)),
             J = N_buildings,
             K = ncol(building_data), # number of building-specific variables
             building_idx = building_idx,
             building_data = building_data
             )
        )

Compile and fit the hierarchical model

Let’s compile the model.

comp_hier_NB <- stan_model('stan_programs/hier_NB_regression.stan')

Fit the model to data.

fit_hier_NB <-
  sampling(
    comp_hier_NB,
    data = standata_hier,
    refresh = 1000, 
    # can run markov chains in parallel. you can check how many cores you 
    # have available using parallel::detectCores()
    cores = 4 
  )
Warning: There were 160 divergent transitions after warmup. Increasing adapt_delta above 0.8 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
Warning: Examine the pairs() plot to diagnose sampling problems
Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#bulk-ess
Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#tail-ess

Diagnostics

We get a bunch of warnings from Stan about divergent transitions, which is an indication that there may be regions of the posterior that have not been explored by the Markov chains.

Divergences are discussed in more detail in the case study Diagnosing Biased Inference with Divergences, the bayesplot MCMC diagnostics vignette and A Conceptual Introduction to Hamiltonian Monte Carlo.

In this example we will see that we have divergent transitions because we need to reparameterize our model - i.e., we will retain the overall structure of the model, but transform some of the parameters so that it is easier for Stan to sample from the parameter space. Before we go through exactly how to do this reparameterization, we will first go through what indicates that this is something that reparameterization will resolve. We will go through:

  1. Examining the fitted parameter values, including the effective sample size
  2. Plots that reveal particular patterns in locations of the divergences.

Then we print the fits for the parameters that are of most interest.

print(fit_hier_NB, pars = c('sigma_mu','beta','alpha','phi','mu'))
Inference for Stan model: hier_NB_regression.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

          mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
sigma_mu  0.27    0.01 0.17  0.06  0.14  0.23  0.35  0.69   624    1
beta     -0.24    0.00 0.06 -0.36 -0.29 -0.24 -0.20 -0.13   600    1
alpha     1.36    0.02 0.43  0.56  1.07  1.34  1.66  2.21   556    1
phi       1.60    0.01 0.36  1.03  1.34  1.55  1.79  2.44  2152    1
mu[1]     1.40    0.02 0.55  0.35  1.04  1.39  1.78  2.49   657    1
mu[2]     1.34    0.02 0.53  0.33  0.97  1.33  1.71  2.39   702    1
mu[3]     1.52    0.02 0.48  0.63  1.19  1.50  1.86  2.46   676    1
mu[4]     1.56    0.02 0.48  0.65  1.22  1.54  1.88  2.51   704    1
mu[5]     1.15    0.01 0.42  0.36  0.87  1.14  1.43  2.00   783    1
mu[6]     1.26    0.02 0.48  0.35  0.92  1.25  1.61  2.21   592    1
mu[7]     1.57    0.02 0.51  0.60  1.23  1.57  1.92  2.60   708    1
mu[8]     1.33    0.01 0.43  0.50  1.04  1.31  1.61  2.19   836    1
mu[9]     1.51    0.02 0.56  0.41  1.15  1.53  1.87  2.61   678    1
mu[10]    0.92    0.01 0.36  0.22  0.66  0.91  1.15  1.64   897    1

Samples were drawn using NUTS(diag_e) at Fri Apr 24 17:17:00 2020.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

You can see that the effective sample sizes are quite low for many of the parameters relative to the total number of samples (4000). This alone isn’t indicative of the need to reparameterize, but it indicates that we should look further at the trace plots and pairs plots. First let’s look at the traceplots to see if the divergent transitions form a pattern.

# change bayesplot color scheme (to make chains very different colors)
color_scheme_set("brewer-Spectral") 
mcmc_trace(
  # use as.array to keep the markov chains separate for trace plots
  as.array(fit_hier_NB,pars = 'sigma_mu'),
  np = nuts_params(fit_hier_NB),
  window = c(500,1000) # zoom into a window to see more clearly
)

color_scheme_set("blue")

Looks like the divergences (marked in red below the traceplot) correspond to spots where sampler gets stuck.

# assign to object so we can compare to another plot later
scatter_with_divs <- mcmc_scatter(
  as.array(fit_hier_NB),
  pars = c("mu[4]", "sigma_mu"),
  # look at sigma on log-scale (what Stan uses under the hood for positive parameters)
  transform = list("sigma_mu" = "log"), 
  size = 1,
  alpha = 0.3,
  np = nuts_params(fit_hier_NB),
  np_style = scatter_style_np(div_size = 1.5) # change style of divergences points
)
scatter_with_divs + coord_equal()

What we have here is a cloud-like shape, with most of the divergences clustering towards the bottom. We’ll see a bit later that we actually want this to look more like a funnel than a cloud, but the divergences are indicating that the sampler can’t explore the narrowing neck of the funnel.

One way to see why we should expect some version of a funnel is to look at some simulations from the prior, which we can do without MCMC and thus with no risk of sampling problems:

draw_from_prior <- function() {
  sigma_mu_prior <- abs(rnorm(1, mean = 0, sd = 1))
  mu_prior <- rnorm(1, mean = 0, sd = sigma_mu_prior)
  c("mu" = mu_prior, "log(sigma_mu)" = log(sigma_mu_prior))
}

# plot a bunch of draws
prior_draws <- t(replicate(1000, draw_from_prior()))
mcmc_scatter(prior_draws)

Of course, if the data is at all informative we shouldn’t expect the posterior to look exactly like the prior. But unless the data is incredibly informative about the parameters (and the posterior concentrates away from the narrow neck of the funnel) the sampler is going to have to confront the funnel geometry. (See the bayesplot Visual MCMC Diagnostics vignette for more on this.)

Reparameterize and recheck diagnostics

Instead, we should use the non-centered parameterization for \(\mu_b\). This involves taking standard normal random variables and shifting and scaling them to have the mean and standard deviation we want using properties of the normal distribution.

\[ \mu_{\rm raw} \sim \text{Normal}(0,1) \\ \mu = \text{mean} + \text{sd} \times \mu_{\rm raw} \\ \mu = (\alpha + \texttt{building_data} \, \zeta) + \sigma_{\mu} \mu_{\rm raw} \]

This implies \(\texttt{mu}\) is distributed \(\text{Normal}(\alpha + \texttt{building_data}\, \zeta, \sigma_\mu)\) as before, but it decouples the dependence of the density of each element of \(\texttt{mu}\) from \(\texttt{sigma_mu}\) (\(\sigma_\mu\)) and makes the geometry easier for the sampler to navigate.

To implement this in the Stan program we need to put \(\texttt{mu_raw}\) in the parameters block (with a \(\text{Normal}(0, 1)\) prior in the model block), and then define \(\texttt{mu}\) in the transformed parameters block.

transformed parameters {
  vector[J] mu = alpha + building_data * zeta + sigma_mu * mu_raw;
}
comp_hier_NB_ncp <- stan_model('stan_programs/hier_NB_regression_ncp.stan')

Fit the model to the data.

fit_hier_NB_ncp <- sampling(comp_hier_NB_ncp, 
                            data = standata_hier, 
                            refresh = 1000,
                            cores = 4, 
                            seed = 123) # demonstrating setting seed

You should have no divergences (or only a few) this time and much improved effective sample sizes:

print(fit_hier_NB_ncp, pars = c('sigma_mu','beta','alpha','phi','mu'))
Inference for Stan model: hier_NB_regression_ncp.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

          mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
sigma_mu  0.23    0.01 0.18  0.01  0.10  0.19  0.32  0.68  1222    1
beta     -0.24    0.00 0.06 -0.37 -0.28 -0.24 -0.20 -0.13  3307    1
alpha     1.35    0.01 0.43  0.54  1.06  1.34  1.63  2.22  3230    1
phi       1.58    0.01 0.35  1.03  1.34  1.53  1.78  2.41  4576    1
mu[1]     1.39    0.01 0.55  0.35  1.03  1.38  1.75  2.49  3311    1
mu[2]     1.32    0.01 0.52  0.34  0.97  1.31  1.65  2.38  3234    1
mu[3]     1.49    0.01 0.49  0.59  1.15  1.47  1.81  2.48  3404    1
mu[4]     1.53    0.01 0.48  0.60  1.20  1.51  1.84  2.50  3493    1
mu[5]     1.15    0.01 0.41  0.36  0.88  1.15  1.43  1.98  3681    1
mu[6]     1.28    0.01 0.48  0.35  0.94  1.27  1.59  2.23  3142    1
mu[7]     1.55    0.01 0.51  0.58  1.20  1.55  1.89  2.60  3617    1
mu[8]     1.32    0.01 0.42  0.52  1.03  1.31  1.61  2.15  3969    1
mu[9]     1.53    0.01 0.56  0.40  1.17  1.53  1.89  2.61  3413    1
mu[10]    0.92    0.01 0.38  0.23  0.66  0.91  1.16  1.67  4032    1

Samples were drawn using NUTS(diag_e) at Fri Apr 24 17:17:53 2020.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

Let’s make the same scatter plot we made for the model with many divergences and compare:

scatter_no_divs <- mcmc_scatter(
  as.array(fit_hier_NB_ncp),
  pars = c("mu[4]", 'sigma_mu'),
  transform = list('sigma_mu' = "log"),
  size = 1,
  alpha = 0.3,
  np = nuts_params(fit_hier_NB_ncp)
)

bayesplot_grid(
  scatter_with_divs, scatter_no_divs,
  grid_args = list(ncol = 2), 
  ylim = c(-10, 1), 
  subtitles = c("Centered parameterization", 
                "Non-centered parameterization")
)

Comparing the plots, we see that the divergences were a sign that there was part of the posterior that the chains were not exploring.

The marginal PPC plot, again.

y_rep <- as.matrix(fit_hier_NB_ncp, pars = "y_rep")
ppc_dens_overlay(standata_hier$complaints, y_rep[1:200,])

This looks quite nice.

If we’ve captured the building-level means well, then the distribution of predicted means by building should match well with the observed means of the quantity of building complaints by month.

ppc_stat_grouped(
  y = standata_hier$complaints,
  yrep = y_rep,
  group = pest_data$building_id,
  stat = 'mean',
  binwidth = 0.5
)

We weren’t doing terribly with the building-specific means before, but now they are all estimated much more precisely. The model is also able to do a decent job estimating the probability of zero complaints:

prop_zero <- function(x) mean(x == 0)
ppc_stat(
  y = standata_hier$complaints,
  yrep = y_rep,
  stat = prop_zero,
  binwidth = 0.02
)

# plot separately for each building
ppc_stat_grouped(
  y = standata_hier$complaints,
  yrep = y_rep,
  group = pest_data$building_id,
  stat = prop_zero,
  binwidth = 0.1
)

Check quantiles of predictions:

q90 <- function(x) quantile(x, probs = 0.9)
ppc_stat(
  y = standata_hier$complaints,
  yrep = y_rep,
  stat = "q90",
  binwidth = 1
)

Predictions by number of traps:

ppc_intervals_grouped(
  y = standata_hier$complaints,
  yrep = y_rep,
  x = standata_hier$traps + rnorm(standata_hier$N, 0, 0.2), # jitter
  group = standata_hier$building_idx
  # difference between grouping by pest_data$building_id and standata_hier$building_idx
  # is that the latter is consecutive from 1 to N_buildings to make things easier 
  # to code in Stan
) +
  labs(
    x = "Number of traps", 
    y = "Predicted complaints", 
    title = "Traps vs predicted complaints by building"
  )

Varying intercepts and varying slopes

We have some new data that extends the number of time points for which we have observations for each building. This will let us explore how to expand the model a bit more with varying slopes in addition to the varying intercepts and later also model temporal variation.

standata_hier_long <- readRDS('data/standata_hier_long.RDS')

Perhaps if the levels of complaints differ by building, the coefficient for the effect of traps on building does too. We can add this to our model and observe the fit.

\[ \text{complaints}_{b,t} \sim \text{Neg-Binomial}(\lambda_{b,t}, \phi) \\ \lambda_{b,t} = \exp{(\eta_{b,t})}\\ \eta_{b,t} = \mu_b + \kappa_b \, \texttt{traps}_{b,t} + \text{log_sq_foot}_b \\ \mu_b \sim \text{Normal}(\alpha + \texttt{building_data} \, \zeta, \sigma_{\mu}) \\ \kappa_b \sim \text{Normal}(\beta + \texttt{building_data} \, \gamma, \sigma_{\kappa}) \]

Let’s compile the model.

comp_hier_NB_slopes <- stan_model('stan_programs/hier_NB_regression_ncp_slopes.stan')

Fit the model to data and extract the posterior draws needed for our posterior predictive checks.

fit_hier_NB_slopes <-
  sampling(
    comp_hier_NB_slopes,
    data = standata_hier_long,
    chains = 4, 
    cores = 4,
    control = list(adapt_delta = 0.95)
  )
Warning: There were 8 divergent transitions after warmup. Increasing adapt_delta above 0.95 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
Warning: Examine the pairs() plot to diagnose sampling problems

To see if the model infers building-to-building differences in, we can plot a histogram of our marginal posterior distribution for sigma_kappa.

mcmc_hist(
  as.matrix(fit_hier_NB_slopes, pars = "sigma_kappa"),
  binwidth = 0.005
)

print(fit_hier_NB_slopes, pars = c('kappa','beta','alpha','phi','sigma_mu','sigma_kappa','mu'))
Inference for Stan model: hier_NB_regression_ncp_slopes.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

             mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
kappa[1]    -0.02    0.00 0.08 -0.15 -0.07 -0.03  0.02  0.15  1291    1
kappa[2]    -0.43    0.00 0.10 -0.64 -0.49 -0.42 -0.36 -0.25  1856    1
kappa[3]    -0.59    0.00 0.10 -0.80 -0.66 -0.59 -0.52 -0.39  4845    1
kappa[4]    -0.23    0.00 0.07 -0.37 -0.27 -0.23 -0.19 -0.10  3051    1
kappa[5]    -0.60    0.00 0.10 -0.79 -0.67 -0.60 -0.54 -0.42  5059    1
kappa[6]    -0.44    0.00 0.10 -0.66 -0.50 -0.43 -0.37 -0.26  3434    1
kappa[7]    -0.31    0.00 0.07 -0.44 -0.36 -0.31 -0.27 -0.18  4942    1
kappa[8]    -0.24    0.00 0.15 -0.57 -0.33 -0.24 -0.14  0.04  2228    1
kappa[9]     0.08    0.00 0.06 -0.03  0.04  0.08  0.12  0.19  4921    1
kappa[10]   -0.73    0.00 0.15 -1.01 -0.83 -0.73 -0.63 -0.41  1793    1
beta        -0.35    0.00 0.06 -0.47 -0.39 -0.35 -0.31 -0.22  1410    1
alpha        1.46    0.01 0.30  0.87  1.26  1.46  1.66  2.03  3495    1
phi          1.61    0.00 0.18  1.28  1.48  1.60  1.73  2.00  4543    1
sigma_mu     0.50    0.01 0.40  0.02  0.18  0.40  0.73  1.48   752    1
sigma_kappa  0.13    0.00 0.09  0.03  0.07  0.11  0.17  0.37   626    1
mu[1]        0.31    0.02 0.71 -1.36 -0.08  0.41  0.80  1.47  1282    1
mu[2]        1.70    0.01 0.54  0.77  1.33  1.66  2.02  2.87  1702    1
mu[3]        2.14    0.00 0.33  1.51  1.91  2.14  2.36  2.79  5076    1
mu[4]        1.54    0.01 0.51  0.54  1.22  1.54  1.85  2.59  3080    1
mu[5]        2.40    0.01 0.43  1.59  2.10  2.39  2.69  3.28  5218    1
mu[6]        1.92    0.01 0.38  1.22  1.67  1.89  2.14  2.75  3279    1
mu[7]        2.69    0.00 0.25  2.21  2.52  2.68  2.86  3.20  4599    1
mu[8]       -0.45    0.02 0.96 -2.24 -1.07 -0.48  0.15  1.54  2268    1
mu[9]        0.25    0.01 0.57 -0.85 -0.12  0.25  0.62  1.37  5090    1
mu[10]       1.87    0.03 1.06 -0.60  1.28  1.99  2.59  3.68  1377    1

Samples were drawn using NUTS(diag_e) at Fri Apr 24 17:19:32 2020.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

While the model can’t specifically rule out zero from the posterior, it does have mass at small non-zero numbers, so we will leave in the hierarchy over \(\texttt{kappa}\). Plotting the marginal data density again, we can see the model still looks well calibrated.

y_rep <- as.matrix(fit_hier_NB_slopes, pars = "y_rep")
ppc_dens_overlay(
  y = standata_hier_long$complaints,
  yrep = y_rep[1:200,]
)

Check quantiles of predictions:

q90 <- function(x) quantile(x, probs = 0.9)
ppc_stat(
  y = standata_hier_long$complaints,
  yrep = y_rep,
  stat = "q90"
)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Time varying effects and structured priors

We haven’t looked at how cockroach complaints change over time. Let’s look at plots over time. Each panel below is a month and we can see that we’re making basically the same predictions for each month since we haven’t encoding any time structure into the model:

y_rep <- as.matrix(fit_hier_NB_slopes, pars = "y_rep")
select_vec <- which(standata_hier_long$mo_idx %in% 1:12)
ppc_stat_grouped(
  y = standata_hier_long$complaints[select_vec],
  yrep = y_rep[,select_vec],
  group = standata_hier_long$mo_idx[select_vec],
  stat = 'mean'
) + xlim(0, 11)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We might augment our model with a log-additive monthly effect, \(\texttt{mo}_t\).

\[ \eta_{b,t} = \mu_b + \kappa_b \, \texttt{traps}_{b,t} + \texttt{mo}_t + \text{log_sq_foot}_b \]

We have complete freedom over how to specify the prior for \(\texttt{mo}_t\). There are several competing factors for how the number of complaints might change over time. It makes sense that there might be more roaches in the environment during the summer, but we might also expect that there is more roach control in the summer as well. Given that we’re modeling complaints, maybe after the first sighting of roaches in a building, residents are more vigilant, and thus complaints of roaches would increase.

This can be a motivation for using an autoregressive prior for our monthly effects. The model is as follows:

\[ \texttt{mo}_t \sim \text{Normal}(\rho \, \texttt{mo}_{t-1}, \sigma_\texttt{mo}) \\ \equiv \\ \texttt{mo}_t = \rho \, \texttt{mo}_{t-1} +\epsilon_t , \quad \epsilon_t \sim \text{Normal}(0, \sigma_\texttt{mo}) \\ \quad \rho \in [-1,1] \]

This equation says that the monthly effect in month \(t\) is directly related to the last month’s monthly effect. Given the description of the process above, it seems like there could be either positive or negative associations between the months, but there should be a bit more weight placed on positive \(\rho\)s, so we’ll put an informative prior that pushes the parameter \(\rho\) towards 0.5.

Before we write our prior, however, we have a problem: Stan doesn’t implement any densities that have support on \([-1,1]\). We can use variable transformation of a raw variable defined on \([0,1]\) to to give us a density on \([-1,1]\). Specifically,

\[ \rho_{\text{raw}} \in [0, 1] \\ \rho = 2 \times \rho_{\text{raw}} - 1 \]

Then we can use a beta prior on \(\rho_\text{raw}\) to push our estimate of \(\rho\) towards 0.5.

For example a rho_raw ~ beta(10, 5) prior would look like:

suppressPackageStartupMessages(library(ggformula))
gf_dist("beta", shape1 = 10, shape2 = 5) + xlab("rho_raw") 

rho_raw <- rbeta(1000, 10, 5)
rho <- 2 * rho_raw - 1
rho_priors <- cbind(rho_raw, rho)
mcmc_hist(rho_priors, facet_args = list(nrow = 2)) + xlim(-1, 1)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

One further wrinkle is that we have a prior for \(\texttt{mo}_t\) that depends on \(\texttt{mo}_{t-1}\). That is, we are working with the conditional distribution of \(\texttt{mo}_t\) given \(\texttt{mo}_{t-1}\). But what should we do about the prior for \(\texttt{mo}_1\), for which we don’t have a previous time period in the data?

We need to work out the marginal distribution of the first observation. Thankfully we can use the fact that AR models are stationary, so \(\text{Var}(\texttt{mo}_t) = \text{Var}(\texttt{mo}_{t-1})\) and \(\mathbb{E}(\texttt{mo}_t) = \mathbb{E}(\texttt{mo}_{t-1})\) for all \(t\). Therefore the marginal distribution of \(\texttt{mo}_1\) is the same as the marginal distribution of any \(\texttt{mo}_t\).

First we derive the marginal variance of \(\texttt{mo}_{t}\).

\[ \text{Var}(\texttt{mo}_t) = \text{Var}(\rho \texttt{mo}_{t-1} + \epsilon_t) \\ \text{Var}(\texttt{mo}_t) = \text{Var}(\rho \texttt{mo}_{t-1}) + \text{Var}(\epsilon_t) \] where the second line holds by independence of \(\epsilon_t\) and \(\epsilon_{t-1})\). Then, using the fact that \(Var(cX) = c^2Var(X)\) for a constant \(c\) and the fact that, by stationarity, \(\textrm{Var}(\texttt{mo}_{t-1}) = \textrm{Var}(\texttt{mo}_{t})\), we then obtain:

\[ \text{Var}(\texttt{mo}_t)= \rho^2 \text{Var}( \texttt{mo}_{t}) + \sigma_\texttt{mo}^2 \\ \text{Var}(\texttt{mo}_t) = \frac{\sigma_\texttt{mo}^2}{1 - \rho^2} \]

For the mean of \(\texttt{mo}_t\) things are a bit simpler:

\[ \mathbb{E}(\texttt{mo}_t) = \mathbb{E}(\rho \, \texttt{mo}_{t-1} + \epsilon_t) \\ \mathbb{E}(\texttt{mo}_t) = \mathbb{E}(\rho \, \texttt{mo}_{t-1}) + \mathbb{E}(\epsilon_t) \\ \]

Since \(\mathbb{E}(\epsilon_t) = 0\) by assumption we have

\[ \mathbb{E}(\texttt{mo}_t) = \mathbb{E}(\rho \, \texttt{mo}_{t-1}) + 0\\ \mathbb{E}(\texttt{mo}_t) = \rho \, \mathbb{E}(\texttt{mo}_{t}) \\ \mathbb{E}(\texttt{mo}_t) - \rho \mathbb{E}(\texttt{mo}_t) = 0 \\ \mathbb{E}(\texttt{mo}_t) = 0/(1 - \rho) \]

which for \(\rho \neq 1\) yields \(\mathbb{E}(\texttt{mo}_{t}) = 0\).

We now have the marginal distribution for \(\texttt{mo}_{t}\), which, in our case, we will use for \(\texttt{mo}_1\). The full AR(1) specification is then:

\[ \texttt{mo}_1 \sim \text{Normal}\left(0, \frac{\sigma_\texttt{mo}}{\sqrt{1 - \rho^2}}\right) \\ \texttt{mo}_t \sim \text{Normal}\left(\rho \, \texttt{mo}_{t-1}, \sigma_\texttt{mo}\right) \forall t > 1 \]

comp_hier_NB_mos <- stan_model('stan_programs/hier_NB_regression_ncp_slopes_mos.stan')
fit_hier_NB_mos <-
  sampling(
    comp_hier_NB_mos,
    data = standata_hier_long,
    cores = 4,
    control = list(adapt_delta = 0.95)
  )

In the interest of brevity, we won’t go on expanding the model, though we certainly could. What other information would help us understand the data generating process better? What other aspects of the data generating process might we want to capture that we’re not capturing now?

As usual, we run through our posterior predictive checks.

y_rep <- as.matrix(fit_hier_NB_mos, pars = "y_rep")
ppc_dens_overlay(
  y = standata_hier_long$complaints,
  yrep = y_rep[1:200,]
)

y_rep <- as.matrix(fit_hier_NB_mos, pars = "y_rep")
select_vec <- which(standata_hier_long$mo_idx %in% 1:12)
ppc_stat_grouped(
  y = standata_hier_long$complaints[select_vec],
  yrep = y_rep[,select_vec],
  group = standata_hier_long$mo_idx[select_vec],
  stat = 'mean'
)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

As we can see, our monthly random intercept has captured a monthly pattern across all the buildings. We can also compare the prior and posterior for the autoregressive parameter to see how much we’ve learned. Here are two different ways of comparing the prior and posterior visually:

# 1) compare draws from prior and draws from posterior
rho_draws <- cbind(
  2 * rbeta(4000, 10, 5) - 1, # draw from prior
  as.matrix(fit_hier_NB_mos, pars = "rho")
)
colnames(rho_draws) <- c("prior", "posterior")
mcmc_hist(rho_draws, freq = FALSE, binwidth = 0.025,
          facet_args = list(nrow = 2)) + xlim(-1, 1)

print(fit_hier_NB_mos, pars = c('rho','sigma_mu','sigma_kappa','gamma'))
Inference for Stan model: hier_NB_regression_ncp_slopes_mos.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

             mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
rho          0.78    0.00 0.08  0.61  0.73  0.79  0.84  0.91  1257    1
sigma_mu     0.30    0.01 0.23  0.01  0.12  0.25  0.43  0.85  1308    1
sigma_kappa  0.09    0.00 0.05  0.01  0.05  0.08  0.11  0.21  1080    1
gamma[1]    -0.18    0.00 0.11 -0.39 -0.25 -0.18 -0.12  0.03  2275    1
gamma[2]     0.12    0.00 0.07 -0.02  0.07  0.12  0.16  0.27  2086    1
gamma[3]     0.10    0.00 0.15 -0.18  0.02  0.10  0.19  0.39  2291    1
gamma[4]    -0.01    0.00 0.06 -0.14 -0.04  0.00  0.03  0.11  2420    1

Samples were drawn using NUTS(diag_e) at Fri Apr 24 17:21:36 2020.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).
ppc_intervals(
  y = standata_hier_long$complaints,
  yrep = y_rep,
  x = standata_hier_long$traps + rnorm(standata_hier_long$N, 0, 0.2) # jitter
) +
  labs(x = "Number of traps", 
       y = "Number of complaints")

It looks as if our model finally generates a reasonable posterior predictive distribution for all numbers of traps, and appropriately captures the tails of the data generating process.

Using our model: loss forecasts for decision making

Our model seems to be fitting well, so now we will go ahead and use the model to help us make a decision about how many traps to put in our buildings. This decision will balance the benefit of using more traps (fewer complaints and calls to the exterminator) with the cost of setting up and maintaining the traps.

We will generate forecast curves for the buildings for a period of 12 months, which will let us quantify our uncertainty about the revenue (in this case really just cost) projected for any hypothetical number of traps we could use in each building. That is, we will ask the question: over the next year, how much would it cost each building if they used 0 traps, or 1 trap, or 2 traps, etc.?

To do this we first need to predict the number of complaints over 12 months in each building for each hypothetical number of traps (we’ll limit it to a max of 20 traps).

comp_forecast <- stan_model('stan_programs/hier_NB_regression_ncp_slopes_mos_predict.stan')
standata_hier_long$M_forward <- 12

forecast_model <- sampling(
  comp_forecast,
  data = standata_hier_long,
  cores = 4,
  control = list(adapt_delta = 0.95)
)
Warning: There were 5 divergent transitions after warmup. Increasing adapt_delta above 0.95 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
Warning: Examine the pairs() plot to diagnose sampling problems
# extract as a list instead of matrix for convenience below
samps_forecast <- rstan::extract(forecast_model)

# get draws of predicted complaints
predicted_complaints <- samps_forecast$y_pred

# draws x buildings x number of traps
dim(predicted_complaints)
[1] 4000   10   21

Factoring in the various costs for the client

Cost of exterminator

The client has a policy that for every 10 complaints, they’ll call an exterminator costing them $100, so that’ll amount to $10 per complaint:

total_losses <- -10 * predicted_complaints

Cost of materials and labor for maintaining traps

Another key input to our analysis will be the cost of installing traps (really bait stations). We’re simulating the number of complaints we receive over the course of a year, so we need to understand the cost associated with maintaining each bait station over the course of a year. There’s the cost attributed to the raw bait station, which is the plastic housing and the bait material, a peanut-buttery substance that’s injected with insecticide. The cost of maintaining one bait station for a year plus monthly replenishment of the bait material is about $20.

# materials cost
N_traps <- 20
costs <- 20 * (0:N_traps)

We’ll also need labor for maintaining the bait stations, which need to be serviced every two months. If there are fewer than five traps, our in-house maintenance staff can manage the stations (about one hour of work every two months at $20/hour), but above five traps we need to hire outside pest control to help out. They’re a bit more expensive, so we’ve put their cost at $30 / hour. Each five traps should require an extra person-hour of work, so that’s factored in as well. The marginal person-person hours above five traps are at the higher pest-control-labor rate.

# labor costs
N_months_forward <- 12
N_months_labor <- N_months_forward / 2
hourly_rate_low <- 20
hourly_rate_high <- 30
costs <- costs +
  (0:N_traps < 5 & 0:N_traps > 0) * (N_months_labor * hourly_rate_low) +
  (0:N_traps >= 5 & 0:N_traps < 10) * (N_months_labor * (hourly_rate_low + 1 * hourly_rate_high)) +
  (0:N_traps >= 10 & 0:N_traps < 15) * (N_months_labor * (hourly_rate_low + 2 * hourly_rate_high)) +
  (0:N_traps >= 15) * (N_months_labor * (hourly_rate_low + 3 * hourly_rate_high))
print(costs)
 [1]    0  140  160  180  200  400  420  440  460  480  680  700  720  740  760
[16]  960  980 1000 1020 1040 1060

We can now include the materials and labor into the total revenue lost:

total_losses <- sweep(total_losses, MARGIN = 3, 
                      STATS = costs, FUN = '-')

Plotting forecasts

We can now plot curves with number of traps on the x-axis and revenue forecasts and uncertainty intervals on the y-axis.

# calculate medians and central intervals (for example 90% intervals)
median_loss <- t(apply(total_losses, c(2, 3), median))
lower_loss <- t(apply(total_losses, c(2, 3), quantile, 0.05))
upper_loss <- t(apply(total_losses, c(2, 3), quantile, 0.95))

buildings <- seq(1, ncol(median_loss))
traps <- seq(0, nrow(median_loss) - 1)

loss_df <-
  data.frame(
    loss = as.vector(median_loss),
    lower = as.vector(lower_loss),
    upper = as.vector(upper_loss),
    traps = rep(traps, times = max(buildings)),
    building = rep(buildings, each = max(traps) + 1)
  )

ggplot(data = loss_df, aes(x = traps, y = loss)) +
  geom_ribbon(aes(ymin = lower, ymax = upper), fill = "skyblue") +
  geom_line() + 
  labs(x = "Number of traps", y = "Revenue") +
  facet_wrap(~ building, scales = 'fixed', ncol = 3) 


Left as an exercise for the reader (unless we have time):

  • How would we make a forecast for a new building?

  • If we wanted to maximize expected revenue (in this case it will be negative since we’re just dealing with costs/losses), we can take expectations at each station count for each building, and choose the trap number that maximizes expected revenue for that building. This will be called a maximum revenue strategy. How can we generate the distribution of portfolio revenue (i.e. the sum of revenue across all the buildings) under the maximum revenue strategy from the predictions we already have?

Gaussian process instead of AR(1)

Joint density for AR(1) process

We can derive the joint distribution for the AR(1) process before we move to the Gaussian process (GP) which will give us a little more insight into what a GP is. Remember that we’ve specified the AR(1) prior as:

\[ \begin{aligned} \texttt{mo}_1 & \sim \text{Normal}\left(0, \frac{\sigma_\texttt{mo}}{\sqrt{1 - \rho^2}}\right) \\ \texttt{mo}_t & \sim \text{Normal}\left(\rho \, \texttt{mo}_{t-1}, \sigma_\texttt{mo}\right) \forall t > 1 \end{aligned} \] Rewriting our process in terms of the errors will make the derivation of the joint distribution clearer

\[ \begin{aligned} \texttt{mo}_1 & \sim \text{Normal}\left(0, \frac{\sigma_\texttt{mo}}{\sqrt{1 - \rho^2}}\right) \\ \texttt{mo}_t & = \rho \, \texttt{mo}_{t-1} + \sigma_\texttt{mo}\epsilon_t \\ \epsilon_t & \sim \text{Normal}\left(0, 1\right) \end{aligned} \]

Given that our first term \(\texttt{mo}_1\) is normally distributed, and subsequent terms are sums of normal random variables, this means that jointly the vector mo is multivariate normal.

More formally, if we have a vector \(x \in \mathbb{R}^M\) which is multivariate normal, \(x \sim \text{MultiNormal}(0, \Sigma)\), and we left-multiply \(x\) by a nonsingular matrix \(L \in \mathbb{R}^{M\times M}\), then \(y = Lx \sim \text{MultiNormal}(0, L\Sigma L^T)\). We can use this fact to show that our vector mo is jointly multivariate normal.

Just as before with the noncentered parameterization, we’ll take a vector \(\texttt{mo_raw} \in \mathbb{R}^M\) (in which each element is univariate \(\text{Normal}(0,1)\)) and transform it into mo. But instead of doing this with scalar transformations like we did previously (e.g., in the section Time varying effects and structured priors) we will now do it with linear algebra operations.

The trick is that by specifying each element of \(\texttt{mo_raw}\) to be distributed \(\text{Normal}(0,1)\) we are implicitly defining \(\texttt{mo_raw} \sim \text{MultiNormal}(0, I_M)\), where \(I_M\) is the \(M \times M\) identity matrix. Then we do a linear transformation using a matrix \(L\) and assign the result to mo, i.e., \(\texttt{mo} = L\times\texttt{mo_raw}\). This gives us \(\texttt{mo} \sim \text{MultiNormal}(0, LI_M L^T)\) and \(LI_M L^T = LL^T\).

Consider the case where we have three elements in mo and we want to figure out the form for \(L\).

The first element of mo is fairly straightforward, because it mirrors our earlier parameterization of the AR(1) prior. The only difference is that we’re explicitly adding the last two terms of mo_raw into the equation so we can use matrix algebra for our transformation. \[ \texttt{mo}_1 = \frac{\sigma_{\texttt{mo}}}{\sqrt{1 - \rho^2}} \times \texttt{mo_raw}_1 + 0 \times \texttt{mo_raw}_2 + 0 \times \texttt{mo_raw}_3\\ \] The second element is a bit more complicated:

\[ \begin{aligned} \texttt{mo}_2 & = \rho \texttt{mo}_1 + \sigma_{\texttt{mo}}\,\texttt{mo_raw}_2 + 0 \times \texttt{mo_raw}_3 \\ & = \rho \left(\frac{\sigma_{\texttt{mo}}}{\sqrt{1 - \rho^2}} \times \texttt{mo_raw}_1\right) + \sigma_{\texttt{mo}}\,\texttt{mo_raw}_2 + 0 \times \texttt{mo_raw}_3 \\[5pt] & = \frac{\rho \sigma_{\texttt{mo}}}{\sqrt{1 - \rho^2}} \times \texttt{mo_raw}_1 + \sigma_{\texttt{mo}}\,\texttt{mo_raw}_2 + 0 \times \texttt{mo_raw}_3 \\ \end{aligned} \]

While the third element will involve all three terms \[ \begin{aligned} \texttt{mo}_3 & = \rho \, \texttt{mo}_2 + \sigma_{\texttt{mo}}\,\texttt{mo_raw}_3 \\ & = \rho \left(\frac{\rho \sigma_{\texttt{mo}}}{\sqrt{1 - \rho^2}} \times \texttt{mo_raw}_1 + \sigma_{\texttt{mo}}\,\texttt{mo_raw}_2\right) + \sigma_{\texttt{mo}} \texttt{mo_raw}_3 \\[5pt] & = \frac{\rho^2 \sigma_{\texttt{mo}}}{\sqrt{1 - \rho^2}} \times \texttt{mo_raw}_1 + \rho \, \sigma_{\texttt{mo}}\,\texttt{mo_raw}_2 + \sigma_{\texttt{mo}}\,\texttt{mo_raw}_3 \\ \end{aligned} \]

Writing this all together:

\[ \begin{aligned} \texttt{mo}_1 & = \frac{\sigma_{\texttt{mo}}}{\sqrt{1 - \rho^2}} \times \texttt{mo_raw}_1 + 0 \times \texttt{mo_raw}_2 + 0 \times \texttt{mo_raw}_3\\[3pt] \texttt{mo}_2 & = \frac{\rho \sigma_{\texttt{mo}}}{\sqrt{1 - \rho^2}} \times \texttt{mo_raw}_1 + \sigma_{\texttt{mo}}\,\texttt{mo_raw}_2 + 0 \times \texttt{mo_raw}_3 \\[3pt] \texttt{mo}_3 & = \frac{\rho^2 \sigma_{\texttt{mo}}}{\sqrt{1 - \rho^2}} \times \texttt{mo_raw}_1 + \rho \, \sigma_{\texttt{mo}}\,\texttt{mo_raw}_2 + \sigma_{\texttt{mo}}\,\texttt{mo_raw}_3 \\ \end{aligned} \] Separating this into a matrix of coefficients \(L\) and the vector mo_raw:

\[ \texttt{mo} = \begin{bmatrix} \sigma_\texttt{mo} / \sqrt{1 - \rho^2} & 0 & 0 \\ \rho \sigma_\texttt{mo} / \sqrt{1 - \rho^2} & \sigma_\texttt{mo} & 0 \\ \rho^2 \sigma_\texttt{mo} / \sqrt{1 - \rho^2} & \rho \,\sigma_\texttt{mo} & \sigma_\texttt{mo} \end{bmatrix} \times \texttt{mo_raw} \]

If we multiply \(L\) on the right by its transpose \(L^T\), we’ll get expressions for the covariance matrix of our multivariate random vector mo:

\[ \begin{bmatrix} \sigma_\texttt{mo} / \sqrt{1 - \rho^2} & 0 & 0 \\ \rho \sigma_\texttt{mo} / \sqrt{1 - \rho^2} & \sigma_\texttt{mo} & 0 \\ \rho^2 \sigma_\texttt{mo} / \sqrt{1 - \rho^2} & \rho \,\sigma_\texttt{mo} & \sigma_\texttt{mo} \end{bmatrix} \times \begin{bmatrix} \sigma_\texttt{mo} / \sqrt{1 - \rho^2} & \rho \sigma_\texttt{mo} / \sqrt{1 - \rho^2} & \rho^2 \sigma_\texttt{mo} / \sqrt{1 - \rho^2} \\ 0 & \sigma_\texttt{mo} & \rho \,\sigma_\texttt{mo} \\ 0 & 0 & \sigma_\texttt{mo} \end{bmatrix} \]

which results in:

\[ \begin{bmatrix} \sigma^2_\texttt{mo} / (1 - \rho^2) & \rho \, \sigma^2_\texttt{mo} / (1 - \rho^2) & \rho^2 \, \sigma^2_\texttt{mo} / (1 - \rho^2)\\ \rho \, \sigma^2_\texttt{mo} / (1 - \rho^2) & \sigma^2_\texttt{mo} / (1 - \rho^2) & \rho \, \sigma^2_\texttt{mo} / (1 - \rho^2) \\ \rho^2 \sigma^2_\texttt{mo} / (1 - \rho^2) & \rho \, \sigma^2_\texttt{mo} / (1 - \rho^2) & \sigma^2_\texttt{mo} / (1 - \rho^2) \end{bmatrix} \] We can simplify this result by dividing the matrix by \(\sigma^2_\texttt{mo} / (1 - \rho^2)\) to get

\[ \begin{bmatrix} 1 & \rho & \rho^2 \\ \rho & 1 & \rho \\ \rho^2 & \rho & 1 \end{bmatrix} \]

This should generalize to higher dimensions pretty easily.

We can replace the Stan code defining mo in stan_programs/hier_NB_regression_ncp_slopes_mos.stan with the following:

vector[M] mo;
{
  matrix[M,M] A = rep_matrix(0, M, M);
  A[1,1] = sigma_mo / sqrt(1 - rho^2);
  for (m in 2:M) {
    A[m,1] = rho^(m-1) * sigma_mo / sqrt(1 - rho^2);
  }
  for (m in 2:M) {
    A[m,m] = sigma_mo;
    for (i in (m + 1):M) {
      A[i,m] = rho^(i-m) * sigma_mo;
    }
  }
  mo = A * mo_raw;
}

Cholesky decomposition

If we only knew the covariance matrix of our process, say a matrix called \(\Sigma\), and we had a way of decomposing \(\Sigma\) into \(L L^T\) we wouldn’t need to write out the equation for the vector. Luckily, there is a matrix decomposition called the Cholesky decomposition that does just that. The Stan function for the composition is called cholesky_decompose. Instead of writing out the explicit equation we can just do the following

vector[M] mo;
{
  mo = cholesky_decompose(Sigma) * mo_raw;
}

provided we’ve defined Sigma appropriately elsewhere in the transformed parameters block. Note that the matrix \(L\) is lower-triangular (i.e. all elements in the upper right triangle of the matrix are zero).

We’ve already derived the covariance matrix Sigma for the three-dimensional AR(1) process above by explicitly calculating \(L L^T\), but we can do so using the rules of covariance and the way our process is defined. We already know that each element of \(\texttt{mo}_t\) has marginal variance \(\sigma^2_\texttt{mo} / (1- \rho^2)\), but we don’t know the covariance of \(\texttt{mo}_t\) and \(\texttt{mo}_{t+h}\). We can do so recursively. First we derive the covariance for two elements of \(\texttt{mo}_t\) separated by one month:

\[ \text{Cov}(\texttt{mo}_{t+1},\texttt{mo}_{t}) = \text{Cov}(\rho \, \texttt{mo}_{t} + \sigma_\texttt{mo}\epsilon_{t+1},\texttt{mo}_{t}) \\ \text{Cov}(\texttt{mo}_{t+1},\texttt{mo}_{t}) = \rho \text{Cov}(\texttt{mo}_{t},\texttt{mo}_{t}) + \sigma_\texttt{mo}\text{Cov}(\epsilon_{t+1},\texttt{mo}_{t}) \\ \text{Cov}(\texttt{mo}_{t+1},\texttt{mo}_{t}) = \rho \text{Var}(\texttt{mo}_t) + 0 \]

Then we define the covariance for \(\text{Cov}(\texttt{mo}_{t+h},\texttt{mo}_{t})\) in terms of \(\text{Cov}(\texttt{mo}_{t+h-1},\texttt{mo}_{t})\)

\[ \text{Cov}(\texttt{mo}_{t+h},\texttt{mo}_{t}) = \text{Cov}(\rho \, \texttt{mo}_{t+h-1} + \sigma_\texttt{mo}\epsilon_{t+h},\texttt{mo}_{t}) \\ \text{Cov}(\texttt{mo}_{t+h},\texttt{mo}_{t}) = \rho \, \text{Cov}(\texttt{mo}_{t+h-1},\texttt{mo}_{t}) \\ \] Which we can use to recursively get at the covariance we need:

\[ \text{Cov}(\texttt{mo}_{t+h},\texttt{mo}_{t}) = \rho \, \text{Cov}(\texttt{mo}_{t+h-1},\texttt{mo}_{t}) \\ \text{Cov}(\texttt{mo}_{t+h},\texttt{mo}_{t}) = \rho \,( \rho \, \text{Cov}(\texttt{mo}_{t+h-2},\texttt{mo}_{t}) )\\ \dots \\ \text{Cov}(\texttt{mo}_{t+h},\texttt{mo}_{t}) = \rho^h \, \text{Cov}(\texttt{mo}_{t},\texttt{mo}_{t}) \\ \text{Cov}(\texttt{mo}_{t+h},\texttt{mo}_{t}) = \rho^h \, \sigma_\texttt{mo}^2/(1 - \rho^2) \\ \]

One way to write this in Stan code is:

vector[M] mo;
{
  matrix[M,M] Sigma;
  for (m in 1:M) {
    Sigma[m,m] = 1.0;
    for (i in (m + 1):M) {
      Sigma[i,m] = rho^(i - m);
      Sigma[m,i] = Sigma[i,m];
    }
  }
  Sigma = Sigma * sigma_mo^2 / (1 - rho^2);
  mo = cholesky_decompose(Sigma) * mo_raw;
}

Extension to Gaussian processes

The prior we defined for mo is strictly speaking a Gaussian process. It is a stochastic process that is distributed as jointly multivariate normal for any finite value of \(M\). Formally, we could write the above prior for mo like so:

\[ \begin{aligned} \sigma_\texttt{mo} & \sim \text{Normal}(0, 1) \\ \rho & \sim \text{GenBeta}(-1,1,10, 5) \\ \texttt{mo}_t & \sim \text{GP}\left( 0, K(t | \sigma_\texttt{mo},\rho) \right) \\ \end{aligned} \] The notation \(K(t | \sigma_\texttt{mo},\rho)\) defines the covariance matrix of the process over the domain \(t\), which is months.

In other words:

\[ \text{Cov}(\texttt{mo}_t,\texttt{mo}_{t+h}) = k(t, t+h | \sigma_\texttt{mo}, \rho) \]

We’ve already derived the covariance for our process. What if we want to use a different definition of Sigma?

As the above example shows defining a proper covariance matrix will yield a proper multivariate normal prior on a parameter. We need a way of defining a proper covariance matrix. These are symmetric positive definite matrices. It turns out there is a class of functions that define proper covariance matrices, called kernel functions. These functions are applied elementwise to construct a covariance matrix, \(K\):

\[ K_{[t,t+h]} = k(t, t+h | \theta) \] where \(\theta\) are the hyperparameters that define the behavior of covariance matrix.

One such function is called the exponentiated quadratic function, and it happens to be implemented in Stan as cov_exp_quad. The function is defined as:

\[ \begin{aligned} k(t, t+h | \theta) & = \alpha^2 \exp \left( - \dfrac{1}{2\ell^2} ((t+h) - t)^2 \right) \\ & = \alpha^2 \exp \left( - \dfrac{h^2}{2\ell^2} \right) \end{aligned} \]

The exponentiated quadratic kernel has two components to theta, \(\alpha\), the marginal standard deviation of the stochastic process \(f\), and \(\ell\), the process length-scale.

The length-scale defines how quickly the covariance decays between time points, with large values of \(\ell\) yielding a covariance that decays slowly, and with small values of \(\ell\) yielding a covariance that decays rapidly. It can be seen interpreted as a measure of how nonlinear the mo process is in time.

The marginal standard deviation defines how large the fluctuations are on the output side, which in our case is the number of roach complaints per month across all buildings. It can be seen as a scale parameter akin to the scale parameter for our building-level hierarchical intercept, though it now defines the scale of the monthly deviations.

This kernel’s defining quality is its smoothness; the function is infinitely differentiable. That will present problems for our example, but if we add some noise the diagonal of our covariance matrix, the model will fit well.

\[ k(t, t+h | \theta) = \alpha^2 \exp \left( - \dfrac{h^2}{2\ell^2} \right) + \text{if } h = 0, \, \sigma^2_\texttt{noise} \text{ else } 0 \]

Compiling the GP model

comp_hier_NB_gp <- stan_model('stan_programs/hier_NB_regression_ncp_slopes_mos_gp.stan')

Fitting the GP model to data

fit_hier_NB_gp <- sampling(comp_hier_NB_gp, data = standata_hier_long, chains = 4, cores = 4, control = list(adapt_delta = 0.9))
Warning: There were 1 divergent transitions after warmup. Increasing adapt_delta above 0.9 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
Warning: Examine the pairs() plot to diagnose sampling problems
samps_gp <- rstan::extract(fit_hier_NB_gp)

Examining the fit

Let’s look at the prior vs. posterior for the GP length scale parameter:

length_scale_draws <- cbind(
  prior = rgamma(4000, 10, 2),
  posterior = samps_gp$gp_len
)
mcmc_areas(length_scale_draws)

From the plot above it only looks like we learned a small amount, however we can see a bigger difference between the prior and posterior if we consider how much we learned about the ratio of sigma_gp to the length scale gp_len:

noise_to_length_scale_ratio_draws <- cbind(
  prior = abs(rnorm(4000)) / rgamma(4000, 10, 2),
  posterior = samps_gp$sigma_gp / samps_gp$gp_len
)
mcmc_areas(noise_to_length_scale_ratio_draws)

This is a classic problem with Gaussian processes. Marginally, the length-scale parameter isn’t very well-identified by the data, but jointly the length-scale and the marginal standard deviation are well-identified.

And let’s compare the estimates for the time varying parameters between the AR(1) and GP. In this case the posterior mean of the time trend is essentially the same for the AR(1) and GP priors but the 50% uncertainty intervals are narrower for the AR(1):

# visualizing 50% intervals
mo_ar_intervals <- mcmc_intervals_data(as.matrix(fit_hier_NB_mos, pars = "mo"), prob = 0.5)
mo_gp_intervals <- mcmc_intervals_data(as.matrix(fit_hier_NB_gp, pars = "gp"), prob = 0.5)
plot_data <- bind_rows(mo_ar_intervals, mo_gp_intervals)
plot_data$prior <- factor(rep(c("AR1", "GP"), each = 36), levels = c("GP", "AR1"))
plot_data$time <- rep(1:36, times = 2)

ggplot(plot_data, aes(x = time, y = m, ymin = l, ymax = h, fill = prior)) +
  geom_ribbon(alpha = 2/3)

The way we coded the GP also lets us plot a decomposition of the GP into a monthly noise component (mo_noise in the Stan code) and the underlying smoothly varying trend (gp_exp_quad in the Stan code):

# visualizing 50% intervals
mo_noise_intervals <- mcmc_intervals_data(as.matrix(fit_hier_NB_gp, pars = "mo_noise"), prob = 0.5)
gp_exp_quad_intervals <- mcmc_intervals_data(as.matrix(fit_hier_NB_gp, pars = "gp_exp_quad"), prob = 0.5)

plot_data <- bind_rows(mo_noise_intervals, gp_exp_quad_intervals)
plot_data$time <- rep(1:36, times = 2)
plot_data$term <- factor(rep(c("Monthly Noise", "Smooth Trend"), each = 36),
                         levels = c("Smooth Trend", "Monthly Noise"))

ggplot(plot_data, aes(x = time, y = m, ymin = l, ymax = h, fill = term)) +
  geom_ribbon(alpha = 0.5) +
  geom_line(aes(color = term), size = 0.5) +
  ylab(NULL)